Don't worry about the codes you'll see here; they are part of the effort to achieve results. Simply read the text and analyze the tables and graphs. Everything is explained according to the results the codes present.
On October 29, 2023, a rainy Sunday, a test was held for the position of administrative technician according to Notice No. 151/2023-PRH. The test took place at the nostalgic State University of Maringá. *I say nostalgic because it was there that I obtained my Bachelor's degree in Music a few years ago, and it was a good time, ah!
I honestly thought I would do well on this test, and that was one of the factors that made me extremely nervous, considering that I had come close but missed the mark in previous exams. However, this time, the subjects were ones I was familiar with. *I mentioned the rainy day, not that it's relevant to the work you're about to see, but because I want to emphasize that this aspect worsened my nervousness due to the traffic we had to face (on that day, my wife drove), even though we had ample time to arrive... well, I don't need to say much more. If you're an exam taker or have been through exams that could decide something crucial for you, even in school, then you'll understand.
I took the test, and Portuguese was a bit challenging, you know? *I usually do well in Portuguese, but I knew this time it wouldn't be my strong suit. However, I was sure I aced the rest, except for mathematics, which is my weakest point (as we'll see in the analyses I did here).
The preliminary result of the objective test score disclosure – after the appeal period, where candidates, if possible, would metaphorically challenge the exam board in hopes of changing the answer key for a question – was released on November 22, 2023, in Notice No. 351/2023-PRH - Objective Test Result. *With this result in hand, I thought:
"Well, since I'm learning data analysis, I'll practice some tools, ask for help from ChatGPT, and do a simple analysis of the preliminary position in which each candidate might be based on the data obtained in the notice with preliminary results. *Who knows, maybe I'll compare a few things here and there?"
And thus, this work is born.
I won't say I understand every line of code present here exactly at the project delivery moment because my focus was on the analysis, on telling a story I already had in mind. But I can say I have enough tools to study and understand what's here, as everything presented comes from a scope where I've studied the basics and know how to search. So, this project serves as a study tool. I can reverse engineer points I don't master and learn from what I've done.
What helped me the most in achieving this were the courses (which I took through the "financial aid application") on the COURSERA - IBM Applied AI Professional Certificate site (among the courses in this track, I've already obtained 4 learning certificates), as well as free courses on prompt engineering from Udemy (for dealing with ChatGPT) and on YouTube. These courses teach how to extract the best from the natural language processing tool that is "GPT."
I can't claim I'm correct in everything I did here; this is an attempt and an accomplishment to demonstrate what I could analyze at a first glance. It's my way of telling a story about the preliminary performance of the candidates, and I apologize for any inconsistencies. I also rely on the help of those who know more than I do to improve my techniques.
Regarding the data, of course, during the journey in this project, I thought:
"Wow! If I had access to each alternative that each candidate answered for each question, I could identify more specific strengths and weaknesses."
Imagine if we could also know what each candidate has for breakfast, who prepares their meals, if they work, if they only study; how much could we extract? But, well, I leave this task to those who sought or "made" these data (if they exist) and will surely handle them better than I, a mere aspiring data analyst.
Given these considerations, I really would like to say that I wrote this introduction last in the project. I knew exactly what I was going to do before starting all of this, but I didn't know exactly how it would turn out. However, the ending was what I expected.
Thank you for embarking on this journey with me!
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
%pip install nbformat --upgrade
Requirement already satisfied: nbformat in c:\users\eflov\anaconda3\lib\site-packages (5.9.2)Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: fastjsonschema in c:\users\eflov\anaconda3\lib\site-packages (from nbformat) (2.16.2) Requirement already satisfied: jsonschema>=2.6 in c:\users\eflov\anaconda3\lib\site-packages (from nbformat) (4.17.3) Requirement already satisfied: jupyter-core in c:\users\eflov\anaconda3\lib\site-packages (from nbformat) (5.3.0) Requirement already satisfied: traitlets>=5.1 in c:\users\eflov\anaconda3\lib\site-packages (from nbformat) (5.7.1) Requirement already satisfied: attrs>=17.4.0 in c:\users\eflov\anaconda3\lib\site-packages (from jsonschema>=2.6->nbformat) (22.1.0) Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in c:\users\eflov\anaconda3\lib\site-packages (from jsonschema>=2.6->nbformat) (0.18.0) Requirement already satisfied: platformdirs>=2.5 in c:\users\eflov\anaconda3\lib\site-packages (from jupyter-core->nbformat) (2.5.2) Requirement already satisfied: pywin32>=300 in c:\users\eflov\anaconda3\lib\site-packages (from jupyter-core->nbformat) (305.1)
ANNEX I TO NOTICE Nº151/2023-PRH
Let's look at the topics covered in the test:
PORTUGUESE LANGUAGE
MATHEMATICS
BASIC COMPUTER SKILLS
BASIC KNOWLEDGE OF LEGISLATION AND STATUTE OF CHILDREN AND ADOLESCENTS
Here, I am just performing some basic data readings without any analyses or insights.
file = r'D:\Jeanco\Meus projetos\notas_prova_uem\resultado_final_csv.csv'
# Data presented with all candidate names, subjects, etc.
df = pd.read_csv(file, skiprows=1)
df
| INSCRIÇÃO | CANDIDATO | L.PORT | MAT | INF O | LEG | EST | NOTA | |
|---|---|---|---|---|---|---|---|---|
| 0 | 4859 | Abgail de Souza | 1,50 | 0,25 | 1,00 | 0,25 | 0,50 | 3,50 |
| 1 | 4751 | Ábia Morais Silva | 1,25 | 0,00 | 0,50 | 0,75 | 0,50 | 3,00 |
| 2 | 12493 | Abinair Trindade Miranda Valerio | 1,00 | 0,25 | 0,25 | 0,50 | 0,50 | 2,50 |
| 3 | 17346 | Abner Fellipe Beliato | 1,25 | 0,00 | 0,75 | 0,25 | 0,50 | 2,75 |
| 4 | 13377 | Abner Mariano | 2,75 | 1,00 | 1,00 | 2,00 | 0,50 | 7,25 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9699 | 10446 | Zildeni Nunes Acacio Cassidori | 1,25 | 0,00 | 0,00 | 0,75 | 0,25 | 2,25 |
| 9700 | 19599 | Zilma Fernandes dos Santos Soares | 1,75 | 0,25 | 0,00 | 1,25 | 0,50 | 3,75 |
| 9701 | 12953 | Zingara Facco Rodrigues | 1,00 | 0,25 | 0,25 | 0,75 | 0,50 | 2,75 |
| 9702 | 8830 | Zuleica Silva Marques de Lima | 2,75 | 0,50 | 0,75 | 0,50 | 0,50 | 5,00 |
| 9703 | 4453 | Zuleika Shiraishi Kagueyama | 1,25 | 0,50 | 0,75 | 1,00 | 0,50 | 4,00 |
9704 rows × 8 columns
Below, I will perform operations to rename the columns to facilitate data manipulation.
# Rename Columns
df.columns = ['Inscricao', 'Candidato', 'Port', 'MAT', 'INF', 'LEG', 'ECA', 'Nota Final']
# Standardize names to lowercase and separate by _
df.columns = df.columns.str.lower().str.replace(' ', '_')
# Viewing the data with renamed columns
df
| inscricao | candidato | port | mat | inf | leg | eca | nota_final | |
|---|---|---|---|---|---|---|---|---|
| 0 | 4859 | Abgail de Souza | 1,50 | 0,25 | 1,00 | 0,25 | 0,50 | 3,50 |
| 1 | 4751 | Ábia Morais Silva | 1,25 | 0,00 | 0,50 | 0,75 | 0,50 | 3,00 |
| 2 | 12493 | Abinair Trindade Miranda Valerio | 1,00 | 0,25 | 0,25 | 0,50 | 0,50 | 2,50 |
| 3 | 17346 | Abner Fellipe Beliato | 1,25 | 0,00 | 0,75 | 0,25 | 0,50 | 2,75 |
| 4 | 13377 | Abner Mariano | 2,75 | 1,00 | 1,00 | 2,00 | 0,50 | 7,25 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9699 | 10446 | Zildeni Nunes Acacio Cassidori | 1,25 | 0,00 | 0,00 | 0,75 | 0,25 | 2,25 |
| 9700 | 19599 | Zilma Fernandes dos Santos Soares | 1,75 | 0,25 | 0,00 | 1,25 | 0,50 | 3,75 |
| 9701 | 12953 | Zingara Facco Rodrigues | 1,00 | 0,25 | 0,25 | 0,75 | 0,50 | 2,75 |
| 9702 | 8830 | Zuleica Silva Marques de Lima | 2,75 | 0,50 | 0,75 | 0,50 | 0,50 | 5,00 |
| 9703 | 4453 | Zuleika Shiraishi Kagueyama | 1,25 | 0,50 | 0,75 | 1,00 | 0,50 | 4,00 |
9704 rows × 8 columns
Quickly and more clearly, let's also take a look at how many rows and columns are in this table.
# Checking the number of rows and columns
df.shape
(9704, 8)
The first value in parentheses indicates the number of rows, and the second value indicates the number of columns.
Now I'll check the data types, which is a common practice in data analysis. Basically, I need to know if I'm dealing with numbers or text, but this is a very simplified explanation, so don't worry about this part.
# Observing Data Types
df.dtypes
inscricao int64 candidato object port object mat object inf object leg object eca object nota_final object dtype: object
The results are above and not quite the data types I expect, as I mentioned. Don't worry about this part. I will perform the conversion of data to the formats that I expect them to have.
# Converting Data Types
cols_to_convert = ['port', 'mat', 'inf', 'leg', 'eca', 'nota_final']
# Function to convert values to float or keep unchanged if not possible
def convert_to_float(value):
try:
return float(value.replace(',', '.')) # Replace comma with dot to avoid issues
except (ValueError, AttributeError): # Handle AttributeError when encountering non-string values
return value
# Apply the function only to specific columns
for column in cols_to_convert:
df[column] = df[column].apply(convert_to_float)
I'll explain the next step in a way that a non-expert can understand. A quick check is being performed to see if there are any null values in each column. As we can see, there are "0" (zero) null values, meaning there are no columns with "missing data." To make it even clearer, take the "candidate" column as an example and see that there is a '0' there... this indicates that there are indeed values in every row of the "candidate" column (zero missing values, to be even more redundant).
# Checking for any null values in the table
df.isnull().sum()
inscricao 0 candidato 0 port 0 mat 0 inf 0 leg 0 eca 0 nota_final 0 dtype: int64
Let's "write" a table that presents basic statistical data. Consider that each column represents a specific subject of the test, and the final grade (sum of all grades in each subject) is in the last column. Basically, the first column where we see count, mean, std, etc. provides some information, of which I want to highlight count, mean, min, max.
# Table to Present Basic Statistics such as Mean, Median, Maximum, Minimum, etc.
df[['port','mat','inf','leg','eca','nota_final']].describe()
| port | mat | inf | leg | eca | nota_final | |
|---|---|---|---|---|---|---|
| count | 9704.000000 | 9704.000000 | 9704.000000 | 9704.000000 | 9704.000000 | 9704.000000 |
| mean | 1.922352 | 0.420213 | 0.770945 | 0.954941 | 0.428329 | 4.496780 |
| std | 0.748732 | 0.335230 | 0.312252 | 0.444426 | 0.130710 | 1.383657 |
| min | 0.250000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.250000 |
| 25% | 1.250000 | 0.250000 | 0.500000 | 0.750000 | 0.250000 | 3.500000 |
| 50% | 1.750000 | 0.250000 | 0.750000 | 1.000000 | 0.500000 | 4.500000 |
| 75% | 2.500000 | 0.750000 | 1.000000 | 1.250000 | 0.500000 | 5.500000 |
| max | 4.500000 | 1.250000 | 1.250000 | 2.000000 | 0.500000 | 9.000000 |
Take some time to analyze the table above. Look at the "nota_final" column, for example. For this column, the "mean" (average) is 4.4, which is a relatively low grade. We had 9704 candidates, and the total value of the test was 10.0. There is much more information that can be extracted from this simple statistical table, but it goes beyond the scope of this project.
# Average of Subjects
subjects = ['port', 'mat', 'inf', 'leg', 'eca']
means = df[subjects].mean()
print(means)
port 1.922352 mat 0.420213 inf 0.770945 leg 0.954941 eca 0.428329 dtype: float64
Before continuing, I will create a copy of the original data, just for safety, in case I need to perform any manipulation that could alter the original data. For cases like this, I will use the created copy:
# Create a General Copy of the Original DataFrame
df_copy = df.copy()
As I mentioned, I found the Portuguese test quite challenging, folks. When I came across that lengthy 170-line text, full of peculiar terms like "de chofre" and "palejam," I confess I felt a bit out of my comfort zone. The writing style, more akin to a college entrance exam than a job competition, made the task even more complex in my opinion.
But now, having observed the averages, an interesting realization: the subject in which I struggled the most was surprisingly the one that had the highest overall average. Portuguese Language leads, even with the complexity of the text.
However, let's analyze an important point: the number of questions. If we consider the distribution of grades according to the number of questions in each subject, we observe that grades for test-related topics were distributed as follows:
Consider that the grades for test-related topics were distributed as follows:
| SUBJECT | QUANTITY * VALUE = TOTAL VALUE OF QUESTIONS |
|---|---|
| Portuguese Language | 20 x 0.25 = 5.00 |
| Mathematics | 05 x 0.25 = 1.25 |
| Basic Computer Skills | 05 x 0.25 = 1.25 |
| Basic Knowledge of Legislation | 08 x 0.25 = 2.00 |
| Statute of Children and Adolescents | 02 x 0.25 = 0.50 |
| Total Questions | 40 x 0.25 = 10.0 |
Here we have something to consider, and looking at the quantity of questions in percentage terms reveals other insights:
Let's use approximate values to make visualization easier.
Portuguese Language has 20 questions, each worth 0.25 points, totaling 5 points. The average in this subject was 1.9, which corresponds to 7.6 questions. Therefore, it can be interpreted as an average of 38% of the total possible questions. On the other hand, the ECA subject has 2 questions, each worth 0.25 points, totaling 0.5 points. The average for this subject was approximately 0.4, which corresponds to 1.68 questions, representing 84% of the total possible questions.
Now, when we directly compare these percentages, is it evident that the average in ECA is higher than in Portuguese Language? Nooooo, what is evident is that the number of correct answers is higher, but not the average. This percentage approach takes into account the total possible scale of correct answers in each subject and offers a more intuitive understanding of students' performance in relation to the number of questions in each discipline.
So, going back to "the averages," the average of 1.9 in Portuguese Language is low, considering exclusively the total possible points (5.0), but it remains the highest average in the dataset.
Interestingly - or not -, I made six mistakes in this subject, making errors in transferring to the final answer sheet, and I also made two mistakes in mathematics. Even with this personal challenge, the analysis suggests that, in a way, students were better prepared for Portuguese Language than I initially perceived, but only when considering the highest average among all subjects.
Now, let's talk a bit more about the "guessing." The idea that the higher average in Portuguese Language is due to guessing does not seem to hold. This is because, even when guessing, the probability of getting all questions right in a longer test will be lower than getting all right in a shorter test. This is because the probability of getting each question right is independent, and the probability of getting all questions right decreases as the number of questions increases. Therefore, it is not correct to state that, by guessing all questions, we would have a higher chance of obtaining a higher average in a longer test than in a shorter test. Although the randomness of guessing can lead to different results, the overall probability of achieving a higher average is not necessarily greater in longer tests. Each question has an independent probability of being answered correctly by chance, and this probability is constant regardless of the total number of questions.
Well, in case it's still not clear, let's visualize it with a graph:
# Bar Chart Function
def create_bar_chart(ax, labels, values, max_values, colors, title, xlabel, ylabel, legend_label):
# Iterate over subjects
for i, label in enumerate(labels):
max_value = max_values[i]
value = values[i]
# Add a bar up to the maximum possible grade
ax.bar(label, max_value, color='lightgray', label=legend_label if i == 0 else "")
# Add a filled bar up to the height corresponding to the average
ax.bar(label, value, color=colors[i])
# Add annotation on the maximum possible grade
ax.annotate(f'{max_value:.2f}', xy=(label, max_value), xytext=(0, 5),
textcoords='offset points', ha='center', va='bottom', color='black', fontsize=8)
# Add labels and title
ax.set_title(title)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
# Add legend
ax.legend()
# Data from describe
data = {
'Subject': ['Portuguese', 'Math', 'CompSci', 'Legislation', 'ECA'],
'Average': [1.922352, 0.420213, 0.770945, 0.954941, 0.428329],
}
# Data for the number of questions and maximum possible grade per subject
questions_max = {
'Subject': ['Portuguese', 'Math', 'CompSci', 'Legislation', 'ECA'],
'Number of Questions': [20, 5, 5, 8, 2],
'Maximum Possible Grade': [5.00, 1.25, 1.25, 2.00, 0.50],
}
# Bar Chart for Averages
fig, ax = plt.subplots(figsize=(10, 5))
create_bar_chart(ax, data['Subject'], data['Average'], questions_max['Maximum Possible Grade'],
['blue', 'orange', 'green', 'red', 'purple'], 'Average per Subject', 'Subject', 'Grade', 'Maximum possible grade')
plt.show()
Let's understand together, once and for all, what the above graph clarifies even more!
Each color represents the overall average achieved by all candidates, as we have seen before in the basic statistics table.
The gray bars? Well, those are like a reminder, showing the maximum possible grade in each subject. A visual way to understand how candidates performed in relation to the overall average, considering the maximum they could achieve. All clear so far?
Note that in Portuguese, we have the highest average in the set. This coincidentally occurs with the fact that it is the subject in which the maximum possible score was also higher, but it is worth noting that there is no direct causal relationship between these two factors, meaning, the maximum one could achieve in Portuguese is 5.0, and this is the highest grade that could be achieved when compared to all subjects in the test.
# Data for the number of questions per subject
questions = {
'Subject': ['Portuguese', 'Math', 'CompSci', 'Legislation', 'ECA'],
'Number of Questions': [20, 5, 5, 8, 2],
}
# Calculate percentage of correct answers based on average and number of questions
data['Percentage Average'] = [(m / (q * 0.25)) * 100 for m, q in zip(data['Average'], questions['Number of Questions'])]
# Bar Chart for Percentage of Correct Answers
fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar(data['Subject'], data['Percentage Average'], color=['blue', 'orange', 'green', 'red', 'purple'])
# Add labels and title
ax.set_title('Percentage of Correct Answers per Subject')
ax.set_xlabel('Subject')
ax.set_ylabel('Percentage of Correct Answers')
# Add values above the bars
for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom')
# Add grid
ax.grid(axis='y', linestyle='--', alpha=0.7)
# Adjust y-axis limits
plt.ylim(0, 100)
plt.show()
For the percentages of correct answers, as we saw, ECA is way ahead compared to Portuguese (and all other subjects). Portuguese is a bit "lower," right?
The graph really shows us what we had considered earlier, and it includes the exact percentage values.
Another point I want to consider is how much subjects like Portuguese and Mathematics have among the lowest percentages of correct answers. Let's highlight a discrepancy between common perception and actual results:
"Take that job entrance exam, it's high school stuff, all easy stuff!!!" Have you heard that before? I have, a lot!!!
Well, the revealed percentage of correct answers on the graph surprises us! Contrary to the common perception that subjects like Portuguese and Mathematics, considered 'basic' in high school, would be easier, the numbers show us another reality. ECA, for example, leads comfortably, demonstrating that familiarity doesn't always translate into success. Subjects like legislation, often underestimated, have proven in this study to be the highest-performing subject.
These results point to something like the importance of valuing and staying updated in all areas, including those considered more basic. After all, the test reveals that what may seem 'easy' does not always correspond to the reality of the percentage of correct answers.
We must take into account, and we cannot forget that the number of questions for each subject was different; but if the difficulty of the questions were maintained, if the candidates were in the same conditions as they were on the day they took the test, and the number of questions were identical for each subject, would the result be very different? A good question, isn't it? And it is entirely coherent, but let's not dwell on that and leave it for another work, but remember that a dataset can reveal much more and leaves us able to explore larger and more complex questions.
At this moment, we are interested in the "top candidates". The notice provided for 17 vacancies, with 14 for general competition, 1 for people with disabilities, and 2 for Afro-descendants.
We will disregard the quotas since we do not have sufficient information, with the data used for our analysis, to list the candidates who could compete for these positions. Additionally, we will disregard age as a tiebreaker criterion, as outlined in the notice:
Now, based on the data in hand, we will indeed investigate which students had higher scores and could be among the 17 who would occupy the available positions. We need to follow some steps, and we will start by analyzing how many students have a score greater than or equal to 6.0. Why? Because the minimum score to be approved, according to the notice, needs to be greater than or equal to 6.0, so everyone within this range is approved.
# Checking the number of scores greater than or equal to 6.0
major_or_equal_6 = df['nota_final'] >= 6.0
major_or_equal_6.sum()
1735
There are 1735 candidates who obtained this score, so we need to check what the scores, the actual numbers, are within this range:
# Verificando os valores únicos dentro do intervalo >=6.0
unique_values = df.loc[major_or_equal_6, 'nota_final'].sort_values(ascending=False)
unique_values.unique()
array([9. , 8.75, 8.5 , 8.25, 8. , 7.75, 7.5 , 7.25, 7. , 6.75, 6.5 ,
6.25, 6. ])
9, 8.75, 8.5, 8.25, 8, 7.75, 7.5, 7.25, 7, 6.75, 6.5, 6.25, 6
The values presented are the unique scores within the interval >=6.0. Now, we need to know the quantity of each score. Notice that we have scores like 9.0, 8.75, and so on, but we still don't know how many values we have for each. So, let's also check that:
# Checking the quantity of each unique score
counts = unique_values.value_counts().sort_values(ascending=True)
# Transforming the score counts into a DataFrame
df_counts = pd.DataFrame(counts).reset_index()
df_counts.columns = ['nota_final', 'quantity']
# Displaying the DataFrame
df_counts
| nota_final | quantity | |
|---|---|---|
| 0 | 9.00 | 2 |
| 1 | 8.75 | 2 |
| 2 | 8.50 | 8 |
| 3 | 8.25 | 22 |
| 4 | 8.00 | 47 |
| 5 | 7.50 | 64 |
| 6 | 7.75 | 71 |
| 7 | 7.25 | 140 |
| 8 | 7.00 | 178 |
| 9 | 6.75 | 194 |
| 10 | 6.50 | 265 |
| 11 | 6.25 | 357 |
| 12 | 6.00 | 385 |
Vou criar um gráfico em que o eixo horizontal representa cada nota e o vertical representa a quantidade de cada uma dessas notas, apenas para ficar mais visual.
# Create bar chart
plt.figure(figsize=(10, 6))
plt.bar(df_counts['nota_final'], df_counts['quantity'], color='blue', width=0.2)
# Add labels and title
plt.xlabel('Final Grade')
plt.ylabel('Quantity')
plt.title('Frequency of Final Grades')
# Add values above the bars
for i, v in enumerate(df_counts['quantity']):
plt.text(df_counts['nota_final'].iloc[i], v + 10, str(v), ha='center', va='bottom')
# Add grid
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
As we can see in the table (and chart) above, we have 2 candidates who scored 9.0 and 2 candidates who scored 8.75. Just out of curiosity, at first, let's see who these candidates are, their names.
# Filtering the original DataFrame for scores 9.0 and 8.75
specific_note = df[df['nota_final'].isin([9.0, 8.75])]
# Displaying only the columns "candidato" and "nota_final"
result = specific_note[['candidato', 'port', 'mat', 'inf', 'leg', 'eca', 'nota_final']]
# Resetting the index
result = result.reset_index(drop=True)
# Displaying the result
result.sort_values(by='nota_final', ascending=False)
| candidato | port | mat | inf | leg | eca | nota_final | |
|---|---|---|---|---|---|---|---|
| 2 | Jefferson Hyan Ferreira | 4.50 | 1.25 | 1.00 | 1.75 | 0.5 | 9.00 |
| 3 | Jonas Luís Rockenbach | 4.25 | 1.25 | 1.25 | 1.75 | 0.5 | 9.00 |
| 0 | Alan Christian Gimenez | 4.25 | 1.25 | 1.00 | 1.75 | 0.5 | 8.75 |
| 1 | Jackeline Santos Neves da Silva | 4.25 | 0.75 | 1.25 | 2.00 | 0.5 | 8.75 |
Com esse "pequeno conjunto de candidatos" fica fácil saber qual deles está em primeiro lugar de acordo com os critérios de desempate, no entanto quando começamos a avaliar as outras notas da tabela, o "trabalho braçal" pode não ser tão produtivo e muito menos intuitivo; então precisamos de uma abordagem mais sistemática que calcule as notas finais, os critérios de desempate e logo depois posicione os candidatos em uma lista. Vamos revisar os critérios de desempate com os quais estamos lidando:
def calculate_placement(df, cutoff_score=6.0):
# Create a copy of the original DataFrame
higher_or_equal_scores = df[df['nota_final'] >= cutoff_score].copy()
# Calculate final scores with tie-break criteria
higher_or_equal_scores['nota_final_with_tiebreak'] = (
higher_or_equal_scores['port'] +
higher_or_equal_scores['mat'] +
higher_or_equal_scores['inf'] +
higher_or_equal_scores['leg'] +
higher_or_equal_scores['eca']
)
# Sort the DataFrame by final scores with tie-break criteria in descending order
higher_or_equal_scores = higher_or_equal_scores.sort_values(
by=['nota_final_with_tiebreak', 'nota_final', 'port', 'leg', 'mat'],
ascending=[False, False, False, False, False]
)
# Add a column with preliminary placement
higher_or_equal_scores['preliminary placement'] = range(1, len(higher_or_equal_scores) + 1)
# Set 'colocacao_preliminar' column as the index
higher_or_equal_scores.set_index('preliminary placement', inplace=True)
# Select the desired columns in the final DataFrame
final_result = higher_or_equal_scores[['candidato', 'port', 'mat', 'inf', 'leg', 'eca', 'nota_final']]
return final_result
# Example of using the function
df_final_result = calculate_placement(df)
df_final_result
| candidato | port | mat | inf | leg | eca | nota_final | |
|---|---|---|---|---|---|---|---|
| preliminary placement | |||||||
| 1 | Jefferson Hyan Ferreira | 4.50 | 1.25 | 1.00 | 1.75 | 0.5 | 9.00 |
| 2 | Jonas Luís Rockenbach | 4.25 | 1.25 | 1.25 | 1.75 | 0.5 | 9.00 |
| 3 | Jackeline Santos Neves da Silva | 4.25 | 0.75 | 1.25 | 2.00 | 0.5 | 8.75 |
| 4 | Alan Christian Gimenez | 4.25 | 1.25 | 1.00 | 1.75 | 0.5 | 8.75 |
| 5 | Naylor Moreira Batista | 4.25 | 0.75 | 1.00 | 2.00 | 0.5 | 8.50 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1731 | Matheus Luiz Pauka Siquieri | 1.75 | 0.75 | 1.25 | 1.75 | 0.5 | 6.00 |
| 1732 | Maycon Jose Marcelino | 1.75 | 0.75 | 1.25 | 1.75 | 0.5 | 6.00 |
| 1733 | Guilherme Cesar Ferreira Furtado | 1.75 | 1.25 | 1.00 | 1.50 | 0.5 | 6.00 |
| 1734 | Alexandre Carlos de Alencar Correa | 1.50 | 0.75 | 1.25 | 2.00 | 0.5 | 6.00 |
| 1735 | Gabriela de Almeida Barbosa | 1.50 | 1.00 | 1.25 | 1.75 | 0.5 | 6.00 |
1735 rows × 7 columns
Something we haven't discussed before: notice that there is a jump between the value 5 and 1731 (if you observed carefully, you'll see the same "jump" in other tables presented earlier). The data analysis program gives us only a glimpse of the total, but believe me, all the candidates are in this set, okay? Now we have a table with the list of candidates who scored 6.0 or higher, as well as their ranking presented in the first column, considering the highest final score and the tiebreaker criteria. Let's list the top 17 candidates.
df_top_17 = df_final_result.head(17)
df_top_17
| candidato | port | mat | inf | leg | eca | nota_final | |
|---|---|---|---|---|---|---|---|
| preliminary placement | |||||||
| 1 | Jefferson Hyan Ferreira | 4.50 | 1.25 | 1.00 | 1.75 | 0.50 | 9.00 |
| 2 | Jonas Luís Rockenbach | 4.25 | 1.25 | 1.25 | 1.75 | 0.50 | 9.00 |
| 3 | Jackeline Santos Neves da Silva | 4.25 | 0.75 | 1.25 | 2.00 | 0.50 | 8.75 |
| 4 | Alan Christian Gimenez | 4.25 | 1.25 | 1.00 | 1.75 | 0.50 | 8.75 |
| 5 | Naylor Moreira Batista | 4.25 | 0.75 | 1.00 | 2.00 | 0.50 | 8.50 |
| 6 | Elisangela Pacheco da Silva | 4.25 | 0.50 | 1.25 | 2.00 | 0.50 | 8.50 |
| 7 | Yukiko Sakomoto Belem | 4.25 | 0.50 | 1.25 | 2.00 | 0.50 | 8.50 |
| 8 | Felipe Pfeffer | 4.25 | 1.00 | 1.00 | 1.75 | 0.50 | 8.50 |
| 9 | Thais Nicoletti Silva | 4.25 | 1.00 | 1.25 | 1.50 | 0.50 | 8.50 |
| 10 | Allan Sorrilha Meira Barros | 4.00 | 1.25 | 1.00 | 1.75 | 0.50 | 8.50 |
| 11 | Vanessa Carvalho Fenelon | 4.00 | 1.25 | 1.00 | 1.75 | 0.50 | 8.50 |
| 12 | Bruna Barbara Polizer | 4.00 | 1.00 | 1.25 | 1.75 | 0.50 | 8.50 |
| 13 | Mateus Henrique Aparecido Primilla | 4.25 | 0.50 | 1.25 | 2.00 | 0.25 | 8.25 |
| 14 | Pedro Henrique de Souza Marques | 4.25 | 1.00 | 1.00 | 1.50 | 0.50 | 8.25 |
| 15 | Renan Constantino Colli | 4.25 | 1.00 | 1.00 | 1.50 | 0.50 | 8.25 |
| 16 | Robson dos Santos Mendonça | 4.25 | 1.00 | 1.00 | 1.50 | 0.50 | 8.25 |
| 17 | Gabriel Weber Maximowski | 4.25 | 0.75 | 1.25 | 1.50 | 0.50 | 8.25 |
Above, then, we have the names of those who will work for the state, at least according to the result of this preliminary score analysis :).
I'll create another chart just for us to visualize at what level a candidate ranked compared to another in each subject. Each bar represents a subject, and the vertical axis represents the score for each subject (remember that the maximum score for each subject is different, as we saw in a previous chart).
The created chart will be interactive. Each color represents a specific subject, as indicated in the legend next to the chart ("Discipline"). Initially, with the stacked bars, the chart doesn't make it easy to compare performance in each subject. However, we have an "effect": if we hover over the bars with our mouse, we'll see a "floating" box with the candidate's name, the subject's name, and the score achieved. In the "Discipline" box, on the side of the chart, we can "hide the subjects" by clicking on any of the colors; that subject will disappear from the chart. If we click again on the color that disappeared, it will reappear on the chart.
Try leaving only one color visible; it's much easier to compare performance between candidates that way. Give it a try and tell me if it isn't.
def create_bar_chart(df, title, user_index=None, user_name=None):
# Select the first 17 rows of the resulting DataFrame
df_top_17 = df.head(17).copy()
# Add a row to represent the user's performance
if user_index is not None:
df_user = df[df.index == user_index].copy()
df_user['candidato'] = user_name
df_combined = pd.concat([df_top_17, df_user])
else:
df_combined = df_top_17
# Create an interactive plot with Plotly
fig = px.bar(
df_combined,
x='candidato',
y=['port', 'mat', 'inf', 'leg', 'eca'],
title=title,
labels={'value': 'Nota', 'variable': 'Disciplina'},
hover_name='candidato',
template='plotly',
)
# Add labels and a title
fig.update_layout(
xaxis_title='Aluno',
yaxis_title='Nota',
yaxis=dict(visible=False) # Esconde o eixo y
)
# Adjust the size of the sidebar
fig.update_layout(
margin=dict(l=50, r=20, t=40, b=20) # Ajustar as margens
)
# Increase the width of the user's bar, if applicable
if user_index is not None:
df_combined.loc[df_combined.index == user_index, 'candidato'] = user_name
fig.update_traces(marker=dict(color='rgba(255, 0, 0, 0.7)'), selector=dict(name=user_name))
# Display the interactive plot
fig.show()
# Show the first plot
create_bar_chart(df_top_17, 'Performance comparison by subject of the Top 17 candidates')
Let's see in what position I ended up on the list after the tiebreaker criteria were applied:
df_final_result[df_final_result['candidato'] == 'Jeanco Mateus de Oliveira Volfe']
| candidato | port | mat | inf | leg | eca | nota_final | |
|---|---|---|---|---|---|---|---|
| preliminary placement | |||||||
| 63 | Jeanco Mateus de Oliveira Volfe | 3.5 | 0.75 | 1.25 | 2.0 | 0.5 | 8.0 |
My preliminary placement is 63.
Now I want to put my performance alongside the top performers, it's not going to be a very joyful thing to do, but... hehe Here it goes, THE LAST COLUMN OF THE GRAPH represents my performance.
create_bar_chart(df_final_result, 'Notas por Disciplina para os Top 17 Alunos e Meu Desempenho em relação a todos eles', user_index=63, user_name='Jeanco Mateus de Oliveira Volfe')
I can't say I hit the mark, as there were few spots compared to my preliminary position (maybe I'll do another project with the final result), but I'm definitely improving with each exam I take and getting better every day of study. This is reflected in the score I achieved and in my growing confidence, knowing what I know. If I know, I know I know; and if I don't, then I have to guess (if it's CESPE, depending on the situation, it's not an option - those who understand will get it! lol). If I make a mistake thinking I got it right, I'll study to make fewer mistakes in the future.
Congratulations to everyone who really studied and competed in this exam, and congratulations to those who achieved a good preliminary score. If you're just starting in the world of competitions and got a very low score, I can say, don't get discouraged. I've been there, and I know how it is (and I understand that everyone has a different and often challenging reality). But with every summary, every page read, every "pen strike" and exam, you're getting closer to being one of the top performers in my future performance charts...
Be (keep being) excellent!
As I wrote this piece, I pondered how privileged I am to exist in a time with so many optimizing technologies. I was able to perform these simple analyses in about 5 hours of work, considering the total hours over two days. I must acknowledge the power of tools like ChatGPT as well as the contribution of the data analysis community, which provides free videos, improves documentation, teaches, and more. A more experienced analyst might tell me this is awful, and I wouldn't know what is truly awful. I've been seeking, reading about the subject, asking questions, taking courses, and studying every day to improve. So, I'm grateful to those who can contribute positively.
The work fulfills its purpose in demonstrating some aspects of exam scores, subjects, and candidates. It certainly opened my mind to much more than I could have accomplished.
The analysis not only provided me with a deeper understanding of the exam results but also reinforced my commitment to continuous learning. Furthermore, I believe everyone should have some power to analyze data to have more accurate information. It's fascinating when we can discover something simple that we thought was one way but turns out to be diametrically opposite when we take an honest and "educated" approach using data.
I thank the online community, valuable courses, and ChatGPT (OpenAI) for their contribution to this journey. As I progress, I look forward to applying these skills to more complex challenges and continuing to explore the vast world of data analysis. May this small journey inspire others to discover the power of analysis and to transform data into meaningful information that encourages reflection and the change we desire in the world.